World Cup Goalie Analysis

In this ipython notebook will take a bayesian approach to analyzing just how good Tim Howard's record was during the world cup compared to other goalies. I take an empirical bayes approach.

The actual Bayesian analysis is pretty straightforward but getting the data is not. As a result most of this notebook deals with scraping and preparing the data. Feel free to skip ahead to the actual analysis if this doesn't interest you.

Data Preparation

Unless you happen to be interested in how to scrape data, you should probably skip this section.



In [1]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [216]:

    
% ls data









    



2014 FIFA World Cup Brazil™  Brazil-Croatia - Statistics - FIFA.com.html
2014 FIFA World Cup Brazil™ - Statistics - Matches - Top attacks - FIFA.com_files/
2014 FIFA World Cup Brazil™ - Statistics - Matches - Top attacks - FIFA.com.html
2014 FIFA World Cup Brazil™ - Statistics - Matches - Top defending - FIFA.com_files/
2014 FIFA World Cup Brazil™ - Statistics - Matches - Top defending - FIFA.com.html
2014 FIFA World Cup Brazil™ - Statistics - Matches - Top goals - FIFA.com_files/
2014 FIFA World Cup Brazil™ - Statistics - Matches - Top goals - FIFA.com.html
all_stats.json
attacking.csv
defending.csv
goals.csv
shots.csv
usa.json



In [3]:

    
# grab data!  First focus on world cup data from FIFA.  Tabular form

import pandas as pd

goals = pd.read_csv('data/goals.csv')
goals.head()









    Out[3]:






  
    
      
      Unnamed: 0
      Goals For
      Goals from Inside
      Goals from Outside
      Own Goals For
      Open Play Goals
      Set Piece Goals
      Penalty Goal
    
  
  
    
      0
       BRA-GER
       8
       8
       0
       0
       7
       1
       0
    
    
      1
       SUI-FRA
       7
       6
       1
       0
       5
       2
       0
    
    
      2
       KOR-ALG
       6
       6
       0
       0
       5
       1
       0
    
    
      3
       ESP-NED
       6
       6
       0
       0
       5
       1
       1
    
    
      4
       AUS-NED
       5
       4
       1
       0
       4
       1
       1
    
  

5 rows × 8 columns



In [217]:

    
goals.ix[:, 0][0:5]









    Out[217]:





0    BRA-GER
1    SUI-FRA
2    KOR-ALG
3    ESP-NED
4    AUS-NED
Name: Unnamed: 0, dtype: object



In [5]:

    
goals.describe()









    Out[5]:






  
    
      
      Goals For
      Goals from Inside
      Goals from Outside
      Own Goals For
      Open Play Goals
      Set Piece Goals
      Penalty Goal
    
  
  
    
      count
       55.000000
       55.000000
       55.000000
       55.000000
       55.000000
       55.000000
       55.000000
    
    
      mean
        3.036364
        2.545455
        0.400000
        0.090909
        2.509091
        0.527273
        0.200000
    
    
      std
        1.586560
        1.630930
        0.564374
        0.290129
        1.412546
        0.716332
        0.403687
    
    
      min
        1.000000
        0.000000
        0.000000
        0.000000
        0.000000
        0.000000
        0.000000
    
    
      25%
        2.000000
        1.000000
        0.000000
        0.000000
        1.500000
        0.000000
        0.000000
    
    
      50%
        3.000000
        2.000000
        0.000000
        0.000000
        2.000000
        0.000000
        0.000000
    
    
      75%
        4.000000
        3.000000
        1.000000
        0.000000
        3.000000
        1.000000
        0.000000
    
    
      max
        8.000000
        8.000000
        2.000000
        1.000000
        7.000000
        3.000000
        1.000000
    
  

8 rows × 7 columns

Unfortunately this data isn't usable because it combines the stats for both teams! How lame. It looks like we'll have to scrap the data from all ~55 individual games. We can do this by:

Find the round and match ID for each game from one of their statistics pages.
Use this to go the statistics page for each match
Scrape statistics from each match
Load this into a good data structure for analysis



In [218]:

    
# from bs4 import BeautifulSoup

with open('data/2014 FIFA World Cup Brazil™ - Statistics - Matches - Top attacks - FIFA.com.html', 'r') as f:
    raw_page = f.read()

# big_page_soup = BeautifulSoup(raw_page)



In [219]:

    
import re

example_url = 'http://www.fifa.com/worldcup/matches/round=255931/match=300186456/index.html#nosticky'
url_re = r'http://www.fifa.com/worldcup/matches/round=(\d{6})/match=(\d{9})/index.html'

game_urls = re.findall(url_re, raw_page)

len(game_urls)









    Out[219]:





68



In [220]:

    
game_urls[:5]









    Out[220]:





[('255955', '300186474'),
 ('255955', '300186490'),
 ('255957', '300186502'),
 ('255959', '300186501'),
 ('255957', '300186502')]



In [46]:

    
# Scrape statistics from each game url

import requests
from bs4 import BeautifulSoup

test_url = 'http://www.fifa.com/worldcup/matches/round=255931/match=300186456/statistics.html'

def extract_stats(url):
    """function to extract all statistics"""
    r = requests.get(url).text
    soup = BeautifulSoup(r)
    
#     print(soup)

    # find teams
    header = soup.find('div', 'navbar-matchheader')
    home = header.find('div', 't home').find('span', 't-nTri').text  # or use 't-nText' for full name
    away = header.find('div', 't away').find('span', 't-nTri').text
    
    # find stats
    stats_tables = soup.find('div', 'general-stats').find_all('div', 'teamstats-wholematch')

    data = {}
    
    # extract data
    for stats in stats_tables:
        # find rows
        rows = None
        if stats:
            rows = stats.find_all('tr')
        
        # find stats in each row
        for row in rows:
            try:
                name = row.find('td', 'stats-name').text
                home_stats = row.find('td', attrs={'data-statref': 'home'}).text
                away_stats = row.find('td', attrs={'data-statref': 'away'}).text

                # append data
                data[name] = {home:home_stats, away:away_stats}
    
            except:
                print('error on ' + row)

    return data

test_soup = extract_stats(test_url)
test_soup









    Out[46]:





{u'Blocked': {u'BRA': u'3', u'CRO': u'1'},
 u'Clearances': {u'BRA': u'7', u'CRO': u'13'},
 u'Dangerous attacks': {u'BRA': u'48', u'CRO': u'38'},
 u'Deliveries in Penalty area': {u'BRA': u'10', u'CRO': u'5'},
 u'Goals': {u'BRA': u'3', u'CRO': u'0'},
 u'Off-Target': {u'BRA': u'5', u'CRO': u'6'},
 u'On-Target': {u'BRA': u'9', u'CRO': u'4'},
 u'Passes Completed': {u'BRA': u'433', u'CRO': u'284'},
 u'Saves': {u'BRA': u'3', u'CRO': u'3'},
 u'Total': {u'BRA': u'14', u'CRO': u'10'},
 u'Total attempts': {u'BRA': u'14', u'CRO': u'10'},
 u'Woodwork': {u'BRA': u'0', u'CRO': u'0'}}



In [240]:

    
# Loop through all games and extract stats

import time

all_stats = {}

for round_id, match_id in game_urls:
    url = ('http://www.fifa.com/worldcup/matches/round=' + 
           str(round_id) + '/match=' + str(match_id) + '/statistics.html')
    stats_dict = extract_stats(url)
    all_stats['{}:{}'.format(round_id, match_id)] = stats_dict
    time.sleep(0.1)
    
# all_stats



In [48]:

    
# save stats for later use

import json

with open('data/all_stats.json', 'w') as f:
    json.dump(all_stats, f)



In [222]:

    
# Reformat data of interest!
# We care about a goalies chance of saving an on target goal
# So attempts that are on target, ex blocks

# The other thing is that because we care about goalies,
# we need to switch the team that we are tracking the statistics for

def get_goalie_stats(game, t1, t2):
    """Extracts savable attacks and saves for the goalie on t1"""
    saves = int(game['Saves'][t2])
    savable_attacks = int(game['On-Target'][t2]) - int(game['Blocked'][t2])
    return savable_attacks, saves

goalie_stats = {}
for game_no, (ids, game) in enumerate(all_stats.items()):
    if 'Goals' in game:
        teams = game['Goals'].keys()
        t1 = teams[0]
        t2 = teams[1]

        goalie_stats[t1] = (goalie_stats.get(t1, (0, 0))[0] + get_goalie_stats(game, t1, t2)[0], 
                            goalie_stats.get(t1, (0, 0))[1] + get_goalie_stats(game, t1, t2)[1] )
        goalie_stats[t2] = (goalie_stats.get(t2, (0, 0))[0] + get_goalie_stats(game, t2, t1)[0], 
                            goalie_stats.get(t2, (0, 0))[1] + get_goalie_stats(game, t2, t1)[1])

goalie_stats









    Out[222]:





{u'ALG': (30, 23),
 u'ARG': (18, 15),
 u'AUS': (19, 10),
 u'BEL': (16, 13),
 u'BIH': (16, 13),
 u'BRA': (23, 10),
 u'CHI': (22, 18),
 u'CIV': (13, 8),
 u'CMR': (24, 15),
 u'COL': (27, 22),
 u'CRC': (23, 21),
 u'CRO': (11, 5),
 u'ECU': (21, 18),
 u'ENG': (10, 5),
 u'ESP': (14, 7),
 u'FRA': (13, 10),
 u'GER': (29, 24),
 u'GHA': (17, 12),
 u'GRE': (16, 11),
 u'HON': (18, 11),
 u'IRN': (16, 11),
 u'ITA': (15, 12),
 u'JPN': (14, 8),
 u'KOR': (18, 12),
 u'MEX': (13, 10),
 u'NED': (23, 18),
 u'NGA': (25, 21),
 u'POR': (15, 7),
 u'RUS': (13, 10),
 u'SUI': (29, 22),
 u'URU': (16, 10),
 u'USA': (33, 27)}



In [223]:

    
# Simple save percentages (no prior)

{k:float(saves)/savables for k, (savables, saves) in goalie_stats.items()}









    Out[223]:





{u'ALG': 0.7666666666666667,
 u'ARG': 0.8333333333333334,
 u'AUS': 0.5263157894736842,
 u'BEL': 0.8125,
 u'BIH': 0.8125,
 u'BRA': 0.43478260869565216,
 u'CHI': 0.8181818181818182,
 u'CIV': 0.6153846153846154,
 u'CMR': 0.625,
 u'COL': 0.8148148148148148,
 u'CRC': 0.9130434782608695,
 u'CRO': 0.45454545454545453,
 u'ECU': 0.8571428571428571,
 u'ENG': 0.5,
 u'ESP': 0.5,
 u'FRA': 0.7692307692307693,
 u'GER': 0.8275862068965517,
 u'GHA': 0.7058823529411765,
 u'GRE': 0.6875,
 u'HON': 0.6111111111111112,
 u'IRN': 0.6875,
 u'ITA': 0.8,
 u'JPN': 0.5714285714285714,
 u'KOR': 0.6666666666666666,
 u'MEX': 0.7692307692307693,
 u'NED': 0.782608695652174,
 u'NGA': 0.84,
 u'POR': 0.4666666666666667,
 u'RUS': 0.7692307692307693,
 u'SUI': 0.7586206896551724,
 u'URU': 0.625,
 u'USA': 0.8181818181818182}



In [225]:

    
# number of non-saves
# {k:savables - saves for k, (savables, saves) in goalie_stats.items()}

Bayesian Statistics!

Here I will model a goalies probability of making a save as a bernoulli random variable with a prior given by the beta distribution.

First, I am going to set the prior by choosing the maximum likelihood estimates of the paramters for the beta distribution using all of the data. Looking at many binary outcomes and using a beta distribution as a conjugate prior means I am using the beta-binomial distribution.

Then, I will look at each Goalie's performance by using their results to calculate a posterior probability that they will save an on-target, on-blocked attempt.



In [65]:

    
total_savables = float(sum([a for a, b in goalie_stats.values()]))
total_savables









    Out[65]:





610.0



In [66]:

    
total_saves = float(sum([b for a, b in goalie_stats.values()]))
total_saves









    Out[66]:





439.0



In [226]:

    
# I expect the prior to have an expected mean close to the fraction of
# total saves divided by total savables

total_frac = total_saves/total_savables
total_frac









    Out[226]:





0.719672131147541



In [179]:

    
# First, I will form the log-likelihood for the beta-binomial distribution.
# Luckily it has a nice closed form solution

from scipy.special import betaln, gammaln, comb

def nllf(a, b):
    result = 0
    for savables, saves in goalie_stats.values():
#         result += (log(comb(savables, saves)) + betaln(saves + a, savables - saves + b) - betaln(a, b))
        # think maybe using gammaln would be more precise    
        result += (gammaln(savables + 1) - gammaln(saves + 1) - gammaln(savables - saves + 1) + 
                   betaln(saves + a, savables - saves + b) - betaln(a, b))  
    return -result

# test
nllf(total_frac, 1.0-total_frac)









    Out[179]:





105.25476226621079



In [180]:

    
nllf(total_saves, total_savables)









    Out[180]:





185.09959480676741



In [181]:

    
nllf(0.5, 0.5)









    Out[181]:





103.73915109562266



In [190]:

    
# Now we'll estimate the values of a and b via maximizing the log-likelihood

from scipy.optimize import minimize

def nllf_log_args(arg_array):
    return nllf(exp(arg_array[0]), exp(arg_array[1]))

result = minimize(nllf_log_args, [log(0.5), log(0.5)], tol=10**-3)

result









    Out[190]:





   status: 0
  success: True
     njev: 12
     nfev: 48
 hess_inv: array([[ 0.48088923,  0.46077401],
       [ 0.46077401,  0.45425467]])
      fun: 72.39193662170837
        x: array([ 3.16494542,  2.25544616])
  message: 'Optimization terminated successfully.'
      jac: array([  6.77108765e-05,  -1.70707703e-04])



In [227]:

    
# Calculate the prior parameters in more usable forms

a, b = exp(result.x)
a









    Out[227]:





23.687451098207482



In [228]:

    
b









    Out[228]:





9.5395484971122819



In [229]:

    
# Calculated the expected value of the prior

prior_mean = a/(a+b)
prior_mean









    Out[229]:





0.71289768521693453



In [204]:

    
# Plot the pdf of the prior distribution
# Make a function for repeat use

from scipy.stats import beta

def bayes_goalie_chart(posterior_1_country, posterior_2_country=None, chart_title=None):
    x = arange(0,1,0.01)
    
    if posterior_1_country:
        plot(x, beta(*posteriors[posterior_1_country]).pdf(x),
             label="{} Posterior".format(posterior_1_country))
    
    if posterior_2_country is not None:
        plot(x, beta(*posteriors[posterior_2_country]).pdf(x),
             label="{} Posterior".format(posterior_2_country))
    else:
        plot(x, beta(a, b).pdf(x), label="Prior")
    
    # Other nice formatting stuff
    title(chart_title)
    legend(loc="best")
    xlabel("Probability")
    ylabel("Density")
    show()
    
bayes_goalie_chart(None, chart_title="Probability of a Save")



In [122]:

    
# Calculate posterior distributions

posteriors = {k:(a + saves, b + savables - saves) for k, (savables, saves) in goalie_stats.items()}
posteriors









    Out[122]:





{u'ALG': (46.688210323459558, 16.539863916112218),
 u'ARG': (38.688210323459558, 12.539863916112218),
 u'AUS': (33.688210323459558, 18.539863916112218),
 u'BEL': (36.688210323459558, 12.539863916112218),
 u'BIH': (36.688210323459558, 12.539863916112218),
 u'BRA': (33.688210323459558, 22.539863916112218),
 u'CHI': (41.688210323459558, 13.539863916112218),
 u'CIV': (31.688210323459558, 14.539863916112218),
 u'CMR': (38.688210323459558, 18.539863916112218),
 u'COL': (45.688210323459558, 14.539863916112218),
 u'CRC': (44.688210323459558, 11.539863916112218),
 u'CRO': (28.688210323459558, 15.539863916112218),
 u'ECU': (41.688210323459558, 12.539863916112218),
 u'ENG': (28.688210323459558, 14.539863916112218),
 u'ESP': (30.688210323459558, 16.539863916112218),
 u'FRA': (33.688210323459558, 12.539863916112218),
 u'GER': (47.688210323459558, 14.539863916112218),
 u'GHA': (35.688210323459558, 14.539863916112218),
 u'GRE': (34.688210323459558, 14.539863916112218),
 u'HON': (34.688210323459558, 16.539863916112218),
 u'IRN': (34.688210323459558, 14.539863916112218),
 u'ITA': (35.688210323459558, 12.539863916112218),
 u'JPN': (31.688210323459558, 15.539863916112218),
 u'KOR': (35.688210323459558, 15.539863916112218),
 u'MEX': (33.688210323459558, 12.539863916112218),
 u'NED': (41.688210323459558, 14.539863916112218),
 u'NGA': (44.688210323459558, 13.539863916112218),
 u'POR': (30.688210323459558, 17.539863916112218),
 u'RUS': (33.688210323459558, 12.539863916112218),
 u'SUI': (45.688210323459558, 16.539863916112218),
 u'URU': (33.688210323459558, 15.539863916112218),
 u'USA': (50.688210323459558, 15.539863916112218)}



In [124]:

    
posterior_means = {k:saves/(saves + misses) for k, (saves, misses) in posteriors.items()}
posterior_means









    Out[124]:





{u'ALG': 0.73840949427871994,
 u'ARG': 0.75521500461897817,
 u'AUS': 0.64502110816743319,
 u'BEL': 0.74527006977591448,
 u'BIH': 0.74527006977591448,
 u'BRA': 0.59913505449117299,
 u'CHI': 0.75483729783192954,
 u'CIV': 0.68547545717000857,
 u'CMR': 0.67603550945119228,
 u'COL': 0.75858660434208169,
 u'CRC': 0.7947668656240271,
 u'CRO': 0.6486425379514178,
 u'ECU': 0.76875697520231101,
 u'ENG': 0.66364766018648691,
 u'ESP': 0.64978745836192309,
 u'FRA': 0.72873921048223234,
 u'GER': 0.76634559089623733,
 u'GHA': 0.71052316585418474,
 u'GRE': 0.70464284576006408,
 u'HON': 0.6771328190327367,
 u'IRN': 0.70464284576006408,
 u'ITA': 0.73998829283913037,
 u'JPN': 0.67096130497966455,
 u'KOR': 0.69665336542929712,
 u'MEX': 0.72873921048223234,
 u'NED': 0.74141273531506691,
 u'NGA': 0.76746845756216797,
 u'POR': 0.63631423827989952,
 u'RUS': 0.72873921048223234,
 u'SUI': 0.73420575651376552,
 u'URU': 0.68432923375213883,
 u'USA': 0.76535836056626516}



In [215]:

    
# Calculate the 'rank' of how good a goalie is

ranked_posterior_means = [-x for x in sort([-x for x in posterior_means.values()])]
posterior_rank = {k:ranked_posterior_means.index(v) + 1 for k, v in posterior_means.items()}
posterior_rank









    Out[215]:





{u'ALG': 13,
 u'ARG': 7,
 u'AUS': 30,
 u'BEL': 9,
 u'BIH': 9,
 u'BRA': 32,
 u'CHI': 8,
 u'CIV': 22,
 u'CMR': 25,
 u'COL': 6,
 u'CRC': 1,
 u'CRO': 29,
 u'ECU': 2,
 u'ENG': 27,
 u'ESP': 28,
 u'FRA': 15,
 u'GER': 4,
 u'GHA': 18,
 u'GRE': 19,
 u'HON': 24,
 u'IRN': 19,
 u'ITA': 12,
 u'JPN': 26,
 u'KOR': 21,
 u'MEX': 15,
 u'NED': 11,
 u'NGA': 3,
 u'POR': 31,
 u'RUS': 15,
 u'SUI': 14,
 u'URU': 23,
 u'USA': 5}

Aww, do it turns out that according to this analysis USA is not #1. Tim Howard was good, but it's hard to come out on top of Costa Rica which only allowed two goals.

Let's look at a handful of countries to get a feel for the data. Why not start with Brazil, who seemed to have the worst performance in the World Cup?



In [207]:

    
bayes_goalie_chart('BRA', chart_title="Brazil's Posterior Probability of a Save")

By comparison, how did the USA compare to the Prior?



In [237]:

    
bayes_goalie_chart('USA', chart_title="USA's Posterior Probability of a Save")

And what does it look like when we compare the two?



In [238]:

    
bayes_goalie_chart('USA', posterior_2_country='BRA', chart_title="USA vs BRA")

How does the USA posterior compare to Costa Rica's?



In [210]:

    
bayes_goalie_chart('USA', posterior_2_country='CRC', chart_title="USA vs CRC")

Pretty similar.

Since I did this analysis in the Netherlands, I thought I'd compare it to the Netherland's performance.



In [211]:

    
bayes_goalie_chart('USA', posterior_2_country='NED', chart_title="USA vs NED")

And how different are all of the goalies from each other?



In [235]:

    
for a, b in posteriors.values():
    x = arange(0, 1, 0.01)
    plot(x, beta(a, b).pdf(x), color='b', alpha=0.25)
    
title('All Posteriors')
show()

Again, pretty similar. Soccer seems to have prety random outcomes to me.

Further work

Differentiate between different types of attacks (but where's the data?)
Other ways to estimate the prior



In [ ]:

	Unnamed: 0	Goals For	Goals from Inside	Goals from Outside	Open Play Goals	Set Piece Goals	Penalty Goal
0	BRA-GER	8	8	0	7	1	0
1	SUI-FRA	7	6	1	5	2	0
2	KOR-ALG	6	6	0	5	1	0
3	ESP-NED	6	6	0	5	1	1
4	AUS-NED	5	4	1	4	1	1

	Goals For	Goals from Inside	Goals from Outside	Own Goals For	Open Play Goals	Set Piece Goals	Penalty Goal
count	55.000000	55.000000	55.000000	55.000000	55.000000	55.000000	55.000000
mean	3.036364	2.545455	0.400000	0.090909	2.509091	0.527273	0.200000
std	1.586560	1.630930	0.564374	0.290129	1.412546	0.716332	0.403687
min	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	2.000000	1.000000	0.000000	0.000000	1.500000	0.000000	0.000000
50%	3.000000	2.000000	0.000000	0.000000	2.000000	0.000000	0.000000
75%	4.000000	3.000000	1.000000	0.000000	3.000000	1.000000	0.000000
max	8.000000	8.000000	2.000000	1.000000	7.000000	3.000000	1.000000